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Abstract 



A typical protein structure is a compact packing of connected a-helices 
and/or /3-strands. We have developed a method for generating the ensem- 
ble of compact structures a given set of helices and strands can form. The 
method is tested on structures composed of four a-helices connected by short 
turns. All such natural four-helix bundles that are connected by short turns 
seen in nature are reproduced to closer than 3.6 Angstroms per residue within 
the ensemble. Since structures with no natural counterpart may be targets 
for ab initio structure design, the designability of each structure in the ensem- 
ble - defined as the number of sequences with that structure as their lowest 
energy state - is evaluated using a hydrophobic energy. For the case of four 
a-helices, a small set of highly designable structures emerges, most of which 
have an analog among the known four-helix fold families, however several 
novel packings and topologies are identified. 
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I. INTRODUCTION 



The number of proteins with structures in the Protein Data Bank continues to grow at 
an exponential rate. There is a great diversity of amino-acid sequences in these proteins, yet 
there is much less diversity in the structures themselves. Among currently known structures, 
only several hundred qualitatively distinct folds have been identified - indeed, it has been 
estimated that there are only about 1000 distinct protein folds in nature Has nature 

exhausted all possible folds? If not, how can we design proteins to adopt folds not seen in 
nature? 

Important progress has been made in designing natural folds "from scratch" @-^|. Re- 
cently several attemps have been made to modify natural folds. Dahiyat and Mayo |10| were 
able to design a zinc finger that no longer depended on a zinc ion for stability. Harbury et 



al. [11] were able to design sequences of amino acids so that the superhelical twist of coiled 



coils was right handed, in contrast to the left-handed twist found in nature up to that time 



12|| . Kortemme et al. successfully designed a three stranded /3-sheet protein ] [l"3|j . 

Combinatorial experimental approaches to creating new protein structures are also possi- 
ble. Studies of the folding of random amino-acid sequences by Davidson and Sauer []nj iden- 
tified some sequences which appear to fold. However, the conformations were not sufficiently 
rigid to allow structural determination by either X-ray crystallography or nuclear-magnetic- 



resonance techniques to see if there were novel folds. Recently, Szostak and colleagues [15 



have been able to find folding proteins by in vitro evolution. This method can be used 
to identify proteins which bind to a particular substrate. It gives the ability to design for 
certain function but with no guarantee that the proteins found in this way will be novel 
folds. Another powerful method to evolve for novel functions (or potentially new folds) is in 
vitro DNA recombination fL6 |. But again it has not been applied to screening for new folds. 

Theoretical approaches to the design of qualitatively new folds have followed two paths: 
searching within structure space for new folds |[7|,1J| an d searching in sequence space for se- 



quences that lead to new folds |19|,E0j. The first approach has thus far relied on enumerating 



protein backbones using a finite set of dihedral- angle pairs [21|. In this approach, enumer- 



ating all backbones for proteins of length greater than 30 is computationally intractable. 
Sampling methods can generate longer chains, but so far fail to achieve realistic secondary 
structures [0. The second approach has been attempted using several schemes. One in- 



volves enumerating helical structures using sequence specific contacts [19]. Another uses 
a library of sequences with known structure to assemble possible structures that a given 
sequence may adopt [pD| . However, searching the large space of sequences for potentially 
new folds is a huge computational challenge. 

In this paper, we present a computational method for generating packings of secondary 
structures which, we believe, will facilitate the search for novel protein folds and complement 
the methods described above. Our method is motivated by the following observations: Most 
naturally occurring protein structures are composed of two fundamental building blocks, 
a-helices and /3-strands P^] . A typical protein structure is a packing of helices and strands 
connected by turns. The helices and strands are stabilized by hydrogen bonds, by tertiary 
interactions and by the high propensity of some amino acids to form helices and of others 
to form strands p3f. Because some residues are hydrophobic, the helices and strands pack 



together in a specific way to minimize the exposure of the hydrophobic regions to water. 
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The packing of secondary structural elements, with the connecting turns cut away, is 
generally known as a protein's "stack" . This stack, plus information about which elements 
are connected together by turns, yields the protein's fold [23]. Our method for generating 
protein folds begins by first specifying a fixed number of a-helices and/or /3-strands of fixed 
lengths, and second systematically enumerating all of the possible stacks of these elements. 
The great advantage of using fixed secondary structural elements is that one freezes many 
of the degrees of freedom of the chain. The freezing of these elements can be designed in by 
choosing amino acids with appropriate helical or strand propensities. (Loops can later be 
used to connect the secondary structures pp] , Po| ). To test our scheme for generating stacks, 
we apply it to the packing of four a-helices. Four helix bundles are a good test case as 
the natural bundles fall into a small number of fold families ||27|| , and it has proven possible 
to design four-helix bundles through a careful selection of hydrophobobic-polar sequences 
|7]||. Our method is able to reproduce the four-helix-bundle families in the Structural 
Classification Of Proteins (SCOP) database 

Within a set of stacks, those with no natural counterparts are potential candidates for the 
design of novel protein folds. To identify promising candidates, we consider their "designabil- 
ity" . The designability of a structure is defined as the number of amino-acid sequences which 
have that structure as their lowest energy conformation. In lattice models, it has been shown 
that the sequences associated with highly designable structures have protein-like properties: 
mutational stability, |28|j29|1 thermodynamic stability, |29|j30|] , fast folding kinetics p8|j31 



and tertiary symmetry |32| , |33| . Recently, off-lattice studies of protein structures have also 
shown that certain backbone configurations are highly designable, and that the associated 
sequences have enhanced mutational and thermodynamic stability 0,0]. Hence, we aim to 
identify those stack configurations with high designability, and without natural counterparts, 
as targets for novel structure design. Several novel four-helix folds are identified. 



II. RESULTS 

We applied our structure generation method (described in detail in Methods) to the pack- 
ing of four a-helices. We chose each helix to be 15 residues longQ (each helix has a periodicity 
of 3.6 residues and a rise of 1.5 Angstroms/residue). The backbones of turns connecting the 
helices were not specified, but the turns were constrained to be short. Specifically, we dis- 
carded a stack if any of the end-to-end distances between connected helices exceeded 12 
Angstroms. The method generated a "complete" ensemble of four-helix stacks consisting of 
1,297,808 structures (for a discussion of completeness, see Methods). This large ensemble 
of structures was then clustered, resulting in 188, 538 representative structures. 

To test if the method reproduced the natural four- helix bundles, we selected 11 proteins 
with short turns, from different SCOP families, and searched our representative structures 
for the best fits. To account for length differences between helices in the SCOP structures 



1 The procedure was also tested on the packing of shorter (10 residues) and longer helices (20 
residues), with the short helices producing highly variable packings and the longer helices tending 
to always pack into up and down configurations. 
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(the lengths ranged from 7-18 residues) and the 15-residue helices in our model, we chose 
the shorter length for each comparison. For the longer helix of each mismatched pair, we 
tried all possible truncations down to the shorter length. Thus for each pairing of a SCOP 
structure with one of our representive structures, we computed the best fit among all possible 
combinations of truncations. Fig. [I] shows four overall best fits among all possible pairings. 
For the 11 natural four-helix bundles, the average crms to a representative structure was 
2.86 Angstroms. Table | summarizes the results of fitting the natural four-helix bundles 
to our representive structures. In all cases, the natural structure had a counterpart in the 
representative ensemble at a crms distance of less than 3.6 Angstroms per residue. 

An important goal is to identify stacks with no natural counterparts as candidates for the 
design of novel protein folds. To identify which stacks might be promising candidates, we 
performed a designability calculation using a hydrophobic energy (see Methods) on the en- 
semble of representatives of our four-helix structures. We used a random sample of 4,000,000 
binary amino-acid sequences. Fig. ^| shows the results of the designability calculation. The 
distribution of designabilities is consistent with previous results for both lattice |29| and off- 
lattice models |TT| , |T8"|| - namely, there is a small set of highly designable structures with the 
great majority of structures poorly designable or undesignable. The average designability, 
i.e. the average number of sequences per stack, was 4,000,000/188,538 = 21. The most 
designable structure was the lowest energy state of 1813 sequences. 

Almost all of the designable structures have an analog amongst the four-helix fold fam- 
ilies. The four most designable distinct folds are shown in order of designability in Fig. |3|. 
The topmost designable structure is an up-and-down four-helix bundle, the second most des- 
ignable fold is a variant of the up-and-down fold except that there is a crossover connection, 
the third most designable fold falls within the A-repressor DNA-binding-domain class and 
the last fold is an orthogonal array |24J . Table [TI| presents particular binary sequences which 
have these structures as lowest energy folds. We obtained these sequences by matching 
them to the surface area pattern of each of the four folds and then introducing mutations 
to maximize the energy gap. The energy gap was defined as the smallest energy difference 
to a competing structure at a crms > 4 Angstroms {i.e. a structure with a different fold 
type). Sequences were obtained by first calculating the mean surface-area exposure of each 
side chain for each structure, and assigning a hydrophobic residue to each site with surface 
exposure below the mean. Point and double mutations were randomly performed on the 
sequence by changing H (hydrophobic) to a P (polar) or a P to an H, and the mutation(s) 
was kept if the gap was made larger. This process of mutation was performed until a se- 
quence was obtained where a mutation at any site made the gap smaller. The last column 
in Table |TJ lists the resulting energy gaps. Fig. |](a) shows the pattern of surface exposure 
along each helix for structure (a) of Fig. |3| along with the corresponding HP pattern (red 
for hydrophobic, open for polar). Notice that the HP pattern of the optimized sequence 
in Fig. f|(a) does not always follow the rule H at buried site, P at exposed site. For sites 
which depart from the rule, i.e. a hydrophobic residue on an exposed site, we found that 
the nearest competing structure was even more exposed on that site {e.g. site 14 of helix 
2 and site 13 of helix 3). For the site that had a polar residue on a buried site (site 12 of 
helix 2) the nearest competing structure was less exposed on that site. Thus, it is sometimes 
benificial to have hydrophobic residues exposed and/or polar residues buried in order to 
"design-out" competing structures . 
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An important characteristic of natural proteins is their stability against mutations of 
individual amino acids. Generally, it requires several mutations to cause a natural protein 
to fail to fold. For our four most designable distinct folds, we have analyzed the mutational 
stability of the optimized sequences (Table H). We find that a minimum of three to five 
mutations are required to reduce the energy gap to zero. For structure (a) of Fig. |3], four 
mutations are required to close the gap; the most effective sites for these mutations are 
shown by arrows in Fig. [|. 

Not all of the highly designable structures identified by our method have close analogs 
among known natural folds. We used a vector-based alignment tool called "Mammoth" 



35| to align the top 1000 designable structures against 4188 alpha-helical proteins from 



the SCOP database. In Fig. |], we show two of our designable structures (left) that had 
low alignment scores, along with their closest analogs in the databank (right). The first 
structure, shown in Fig. |^(a), is similar to the POU binding domain. The model structure 
was the 15th most designable among the representative ensemble. Unlike the POU bundle, 
which has three helices coiled with a left-handed twist, the model structure has the same 
three helices coiled with a right-handed twist. We found no similar structure with a right- 
handed twist in the databank. The second structure, shown in Fig. |5](b), is an orthogonal 
array, ranking it 80th among the representative ensemble. The model structure's closest 
natural analog 1AF7 has a long turn connecting helix 1 to helix 2. In the model fold, helix 1 
is reversed allowing it to connect to helix 2 with a short turn. These structures, and others 
with no known natural counterparts, may be candidates for the design of novel folds. 



III. DISCUSSION AND CONCLUSIONS 

We have presented a method for generating protein stacks by packing together fixed 
secondary structural elements. The method was used to generate an ensemble of stacks 
of four a-helices. Each of 11 natural structures, stripped of turns, was matched to within 
3.6 Angstroms crms by a stack in the model ensemble, despite different helix lengths in the 
natural and model structures. The quantitative similarity between the model structures and 
the natural four-helix bundles suggests that the method is a reliable way of exploring the 
space of possible stacks. 

The designabilities of the generated stacks followed the previously observed pattern 
2^, 17] - a small set of structures were highly designable, being lowest energy states of 



many more than their share of sequences, while the majority of structures were poorly des- 
ignable. The universality of this distribution of designabilities in model studies suggests that 
it may apply to real protein structures as well - some structures may be intrinsically much 
more designable than others. Also consistent with previous model studies, sequences which 
fold into highly designable structures were typically thermodynamically stable and stable 
against mutations. We found that a minimum of 3-5 mutations were required to destabilize 
optimized sequences for our most designable structures. Interestingly, the hydrophobic-polar 
patterns of these optimized sequences depart significantly from the simple rule hydrophobic 
at buried sites, polar at exposed sites. 

Almost all of the most designable four-helix stacks emerging from our model have analogs 
among the known four-helix-bundle folds. This suggests that nature has found all, or nearly 
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all, designable four-helix bundles. However, several novel four-helix folds were identified by 
our method. These are now the target of design. 
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V. METHODS 

Generation of ensemble of stacks - The elements of the stack are chosen depending on 
the size and type of protein desired. These elements can be a-helices and/or /3-strands. The 
number of each type of element is specified, as is the length in residues of each element. The 
sequential arrangement of the elements along the protein chain is also specified, along with 
the maximum length of the turns connecting elements. 

Each element in the stack is assumed to be a rigid body, described by its center of mass 
and three Euler angles. The same simplifying assumption has been employed previously by 
Erman, Bahar, and Jernigan in their work on the packing of pairs of a-helices [06| . This 
method also compliments previous work which has looked at the packing of fixed secondary 
structure [p^— 40 1 . An element, helix or strand, is specified by its backbone a-carbon-atom 



positions and its amino-acid side-chain centroids, the latter taken to lie in the direction of the 
/3-carbon at a distance of 2.1 Angstroms from the a-carbon. Helices are constructed using 
a helical periodicity of 3.6 residues and a helical rise of 1.5 Angstroms/residue. Strands are 
created by using a single backbone dihedral angle pair from the beta-strand region of the 
Ramachandran plot. A stack is generated by first randomly selecting the center of mass and 
Euler angles for each element (if an element's center of mass and angles cause it to violate 
self-avoidance with one of the other elements, then its degrees of freedom are re-selected 
randomly). Then these variables are relaxed so as to minimize the packing energy (described 
in detail below). A local minimum of the packing energy is found using a conjugate- gradient 



method, described in Numerical Recipes This yields a stack. With the centers of 



mass and angles determined, various symmetry operations are then performed to generate 
additional stacks. For a-helical elements these are screw operations which correspond to 
rotating the helix by ±100 degrees and translating it by ±1.5 Angstroms along the helix 
direction. For /3-strands, slide operations correspond to translating each residue up or down 
by one residue along the strand direction. Each stack is then checked to see if it satisfies a set 
of supplied constraints. For instance, stacks that exceed a specified total surface exposure 
or compactness measure, or have end-to-end distances of connected elements which exceed 
some cut-off, are excluded from the set. If a stack satisfies the constraints, it is added to 
the ensemble. Stacks are generated in this way until the ensemble of possible stacks for this 
model is complete, as discussed below. 

The choice of packing energy packing is motivated by the hydrophobic force, which 
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produces the compact stacks found in nature. The first term of the packing energy is 



£i = 5> (1) 



where Sj is the surface exposure to water of the i th residue along the chain. The surface 
exposure of each residue is calculated by approximating each side-chain as a sphere with 
radius R,s = 3.1 Angstroms centered at a distance L = 2.1 Angstroms from its ct-carbon 
atom, in the direction of the /3-carbon. The surface exposure Sj of each side-chain sphere 



is found using the method of Flower |42| , with a water molecule represented as a sphere of 
radius Rh 2 o — 1-4 Angstroms. 

We add to this hydrophobic energy a second term which represents the effect of excluded 
volume. This term E% is a pairwise repulsive energy among backbone a-carbon atoms and 
side-chain centroids on different elements. The excluded volume energy is given by, 

* ,3 1,3 1,3 



(2) 



where R a = 1.75 Angstroms and R@ = 2.25 Angstroms are sphere sizes for the backbone 
a-carbon atoms and side-chain centroids, respectively, rfj is the distance between backbone 
a-carbon atoms i and j, rfj is the distance between centroids i and j, and rff is the distance 
between backbone a-carbon atom i and centroid j. Vq sets the scale of the repulsive energy. 

Lastly, we include a weak compression energy E 3 and an energy E4 due to tethers between 
the ends of connected elements. These energies have the form, 

K n 

£3 = yr g 2 , (3) 
where r g is the radius of gyration of the entire stackP], and 

^ = ^^(^-<.) 2 ^-<,), (4) 

i 

where, dij is the distance between the connected ends of tethered elements i and j, and d®j 
is a specified equilibrium length (for the case of the helices above we used 12 Angstroms) 
and 9 is a step function that is if dij < d? - and 1 otherwise. The spring constants, K and 
Kt are chosen to be small so that these terms act as weak perturbations. 

The actual minimization of the total energy packing = Ei + E 2 + E 3 + E4 using the 
conjugate- gradient method proceeds in steps, akin to annealing. The scheduled parameter 
is Vq. Initially V is chosen to be large, so that there is a large repulsion between all the 
elements. (The starting value of Vq varies depending on the number and size of the chosen 
elements. The initial Vq is chosen so as to generate a smooth collapse of the elements. For 



2 r 2 g = l/-^^j(Rp^ fe — Tj) 2 where R^/ fc is the center of mass of the entire stack and is the 



position of centroid j. 
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the case of four-helix bundles we chose a starting Vo of 35.0) At a given Vo, a minimum of 
-E'packmg is found for the full set of center of mass and angle variables. V is then reduced 
by a constant factor (90 %) and a small random change is made to each degree of freedom. 
(The size of the random "kick" is also scaled along with Vo, with the initial kick being 1 
Angstrom for the centers of mass and 15 degrees for each Euler angle). The Vo schedule 
is terminated when any two centroids are at a distance less than some specified contact 
distance, taken to be 2R$. At this point, E3 and £4 are set to zero, leaving only E\ and 
Ei to be minimized in the last conjugate gradient step. Vq is then set to its final valued 
and the last conjugate-gradient minimization is performed to yield final values of each rigid 
element's center of mass and orientation angles. 

Flexible elements - The method described above can be generalized to allow flexibility 
of the secondary structural elements. In natural protein structures, a-helices are relatively 
rigid, while /3-strands are more flexible. Hence, the extension of the method to include 
flexible elements is more important in the case of /3-strands. 

The flexural modes of rod shaped objects are bending, stretching, and twisting. All these 
internal flexural modes can be included in the generation of stacks for both a-helices and 
/^-strands. It is possible to determine the appropriate degree of flexibility for each internal 
mode by reference to known protein structures. A harmonic energy function E^ cx for these 
flexural modes can then be added to the packing energy, with coefficients chosen to reproduce 
the degree of flexibility observed in natural proteins. For example, if the degree of bending 
of an a-helix is represented by the angle 9, then the additional term in -Bracking representing 
this mode would be 

Ee = %9\ (5) 

where the constant cq can be chosen so that the average degree of bending (8 2 ) in the 
generated stacks matches that observed in natural structures. In the current work, however, 
we focus on a-helical proteins and only rigid elements are considered. 

Hydrogen bonding - In natural proteins, /3-strands are typically stabilized by the forma- 
tion of hydrogen bonds between strands. To generate stack configurations which include 
strands it is therefore important to include an inter-strand hydrogen-bonding energy E^b in 
the packing energy -Bracking- One form of a hydrogen-bonding energy function is given in 

n. 

Completeness of stack ensemble - Designability is determined via a competition for 
amino-acid sequences within a complete set of stacks. Since the method for generating 



3 The final value of Vo is determined by a fitting procedure involving a naturally occurring stack 
composed of similar elements. Specifically, Vo is chosen to minimize the crms distance between the 
stack before and after a conjugate-gradient minimization, with fixed E\ and E<i- For the four helix 
bundle we found that a value of Vo = 0.05 produced the best fits to the chosen SCOP structures. Vo 
controls the inter-helical separation, and thus changing it by a few percent only serves to increase 
or decrease the contact distances of helices. Making Vo signficantly different from this makes the 
sidechain spheres unphysically small or large, which can lead to unreasonable packings. 
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stacks is based on random sampling, a criterion must be specified for when to stop sam- 
pling. We stop the generation of structures when a specified fraction of newly generated 
stacks already occurs in the previously generated ensemble. If the fraction is not satisfied, 
the newly generated structures are added to the ensemble, and more stacks are randomly 
generated. We use crms to measure simlarity between the ensemble and the newly generated 
structures, and consider two structures to be similar if their crms is less than 1.5 Angstroms. 
The distance measure, crms, is defined as 

(crms) 2 = i^(r7-r7') 2 (6) 

i 

where r^ s is the position of the i th a-carbon for the s/s' stack and N is the number of 
backbone a-carbons. The stacks s and s' are aligned by performing a least-squares fit using 
crms as the metric. We demand that 95% of the newly generated structures be similar to 
one of the structures in the ensemble before stopping the structure generation procedure. 

Clustering — Many of the randomly generated stacks form clusters of closely related 
structures. It is computationally advantageous to reduce the sample by retaining only one 
member of each cluster. These representative structures are selected in the following way. 
The entire set of stacks is sorted according to total surface exposure, i.e. from most compact 
to least compact. Starting at the top of this list with the most compact stack, we eliminate 
all stacks that are closer to it than 1.5 Angstroms crms. This process is repeated for the 
next most compact structure in the list until the end of the list is reached. We can typically 
compress the large ensemble of structures by a factor of 3 — 5 in this way. 

Designabilities of stacks - The designabilities of the representative stacks, after clustering, 
are determined by allowing the structures to compete for a random sample of possible amino- 
acid sequences. The "designability" of a stack is defined as the number of sequences for which 
that stack has the lowest energy. We assume that the hydrophobic energy is the dominant 
term contributing to the energy of a sequence on a given structure. This energy is given by 

E h = ^ hjSj, (7) 

i 

where hi is the hydrophobicity of the i th element of the sequence and s« is the fractional 
surface exposure of the i th side-chain sphere in the particular stack. For each sequence con- 
sidered, the lowest energy stack in the representative ensemble is determined. By sampling 
a large number of randomly selected sequences, it is possible to reliably estimate the relative 
designabilities of different stacks. 

For the designability calculation, we employed binary sequences consisting of only two 
types of amino acids. Such sequences are also known as "HP-sequences" for hydrophobic 
(H) and polar (P) amino acids. In previous studies, we found only minimal differences in 
the designabilities of top structures when binary sequences and sequences with a continu- 
ous distribution of hydrophobicities were used |T7| . The two hydrophobicity values can be 
written as hi = ho ± Sh, where ho is a compactification energy, and Sh measures the relative 
difference between hydrophobic and polar residues. From the Miyazawa- Jernigan matrix H 



of amino-acid interaction energies, we infer a typical energy difference between hydrophobic 
and polar residues of 1.5/^T/contact. On average a buried residue makes four non-covalent 
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contacts, therefore we take 26h = 6.0/csT. The compactification energy ho was determined 
by fitting the surface-area distribution of the set of 11 natural four-helix bundles given in 
Results to the surface-area distributions for the 100 most designable four-helix stacks, using 
different values of ho to assess designability. The best fit is shown in Fig. |6|, and this corre- 
sponded to ho = 2ksT . Thus in our model hydrophobic residues have a hydrophobicity of 
5ksT and polar residues — l/c^T. 

If flexible a-helices and/or /3-strands are employed in generating stacks, the energy Efi ex 
associated with the flexural modes can be added to the hydrophobic energy E^. Similarly, 
if inter-strand hydrogen bonding is included, E hb can be added as well. The energies E &ex 
and E hb add a sequence independent contribution to each stack. 
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TABLES 



PDB ID 


PTmci ( A ncr^tvnrnQ 1 


1FLX 


2.96 


1FFH 


3.54 


1E6I 


2 85 


1C!B1 

1 V_. 1 J ± 


1 65 


1CEI 


2.95 


1A24 


2.85 


1P0U 


2.81 


1AU7 


3.02 


1EH2 


2.74 


1IMQ 


2.75 


1DNY 


3.44 



TABLE I. Results of fitting selected set of 11 proteins from SCOP database to ensemble of 
model four-helix bundles. 



Structure 


Sequence 


Energy Gap (ksT) 


Minimum Mutations 


a helix 1 
a helix 2 
a helix 3 
a helix 4 


PPHHHHHHPHHPPHH 
HHPPHHPHHPHPHHP 
PPHHPPHHPHHHHHH 
PHHPPHHPHHPHPHP 


6.65 


4 


b helix 1 
b helix 2 
b helix 3 
b heilx 4 


HHPPHHPHHPHHHHP 
HHHPPHHPHHHPHHP 
PPHHHHHPHPPPHHP 
HPHHHPHHPHHPHHH 


5.85 


3 


c helix 1 
c helix 2 
c helix 3 
c helix 4 


HHHHPPHPPPHPPHP 
PHPPHHPPPHPPHHP 
PHPPHHHPPHHPPPP 
PPPPHPPPHPPHHHH 


8.3 


5 


d helix 1 
d helix 2 
d helix 3 
d helix 4 


HPHHHPHHPPHHHPP 
PHHPHHHPHHPPPHP 
PHHPHHHPHHPHHPP 
PHHHPHHPHHHHHHH 


4.70 


3 



TABLE II. Results for the four most designable distinct folds for the model four-helix bundles 
shown in Fig. [3|. Column 2 gives the optimized hydrophobic-polar patterning of each of the length 
15 helices. For these sequences, the third column gives the energy gap in ksT to the nearest 
distinct structural competitor. The last column gives the minimum number of point mutations 
necessary to reduce the energy gap to zero. 
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FIGURES 



FIG. 1. Representative fits for four SCOP proteins (left column) to model four- helix bundles 
(right column): (a) fit for 1EH2 (crms = 2.74 Angstroms/residue), (b) fit for 1FFH (crms = 
3.54 Angstroms/residue), (c) fit for 1CEI (crms = 2.95 Angstroms/residue), (d) fit for 1POU 
(crms = 2.81 Angstroms/residue). Numbers indicate helix number and their location indicates the 
beginning of the given helix. 



FIG. 2. Histogram of the number of structures with a given designability for the representative 
structures of the four-helix-bundle ensemble. Only a few of the structures are highly designable, 
i.e. are lowest energy states of a large number of sequences. Most structures are lowest energy 
states of few or no sequences. 



FIG. 3. Four most designable distinct four- helix folds: (a) up-and-down fold, (b) up-and-down 
with a cross-over connection fold, (c) A-repressor-type fold, (d) orthogonal-array fold. Numbers 
indicate helix number and their location indicates the beginning of the given helix. 



FIG. 4. Surface-area exposure for each of the four helices for structure (a) in Fig. [3| colored 
with the hydrophobic-polar pattern of the optimized sequence (red bar = hydrophobic, open bar 
= polar). All sites with < 10% exposure are occupied by hydrophobic amino acids. Also shown 
are the four mutation sites (arrows) which reduce the energy gap between this structure and its 
competitor to zero (site 2, 6 and 15 of helix 2 and site 3 of helix 3). 



FIG. 5. Two designable four- helix folds with no known natural analogs. On the right are the 
closest aligned naturally occuring folds [35], and on the left are the model structures, (a) 1POU 
has a left-handed twist of the top three helices. The model structure has a right-handed twist of 
these helices, (b) 1AF7 has a long turn connecting helix 1 to helix 2. The model structure has 
helix 1 reversed, allowing a short turn between helix 1 and helix 2. 



FIG. 6. Best fit of surface distribution of the 11 SCOP proteins to top 100 designable structures 
found using ho = 2ksT. 
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